Avoiding the Exploration-Exploitation Tradeoff in Contextual Bandits

نویسندگان

Hamsa Bastani

Mohsen Bayati

Khashayar Khosravi

چکیده

The contextual bandit literature has traditionally focused on algorithms that address the explorationexploitation tradeoff. In particular, greedy algorithms that exploit current estimates without any exploration may be sub-optimal in general. However, exploration-free greedy algorithms are desirable in many practical settings where exploration may be prohibitively costly or unethical (e.g., clinical trials). Surprisingly, we find that a simple greedy algorithm can be rate-optimal if there is sufficient randomness in the observed contexts. We prove that this is always the case for a two-armed bandit under a general class of context distributions that satisfy a condition we term covariate diversity. Furthermore, even absent this condition, we show that a greedy algorithm can be rate-optimal with nonzero probability. Thus, standard bandit algorithms may unnecessarily explore. Motivated by these results, we introduce Greedy-First, a new algorithm that uses only observed contexts and rewards to determine whether to follow a greedy algorithm or to explore. We prove that this algorithm is asymptotically optimal without any additional assumptions on the context distribution or the number of arms. Extensive simulations demonstrate that Greedy-First successfully reduces experimentation and outperforms existing (exploration-based) contextual bandit algorithms such as Thompson sampling, UCB, or -greedy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Contextual Bandits: Approximated Linear Bayes for Large Contexts

Contextual bandits, and in general informed decision making, can be studied in the general stochastic/statistical setting by means of the conditional probability paradigm where Bayes’ theorem plays a central role. However, when informed decisions have to be made considering very large contextual information or the information is contained in too many variables with large history of observations...

متن کامل

Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits

We study contextual bandits with budget and time constraints under discrete contexts, referred to as constrained contextual bandits. The budget and time constraints significantly increase the complexity of exploration-exploitation tradeoff because they introduce coupling among contexts. Such coupling effects make it difficult to obtain oracle solutions that assume known statistics of bandits. T...

متن کامل

Exploration-Free Policies in Dynamic Pricing and Online Decision-Making

Growing availability of data has enabled practitioners to tailor decisions at the individuallevel. This involves learning a model of decision outcomes conditional on individual-specific covariates or features. Recently, contextual bandits have been introduced as a framework to study these online and sequential decision making problems. This literature predominantly focuses on algorithms that ba...

متن کامل

Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits

In this paper, we propose and study opportunistic bandits a new variant of bandits where the regret of pulling a suboptimal arm varies under different environmental conditions, such as network load or produce price. When the load/price is low, so is the cost/regret of pulling a suboptimal arm (e.g., trying a suboptimal network configuration). Therefore, intuitively, we could explore more when t...

متن کامل

Linear Bayes policy for learning in contextual-bandits

Machine and Statistical Learning techniques are used in almost all online advertisement systems. The problem of discovering which content is more demanded (e.g. receive more clicks) can be modeled as a multi-armed bandit problem. Contextual bandits (i.e. bandits with covariates, side information or associative reinforcement learning) associate, to each specific content, several features that de...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2017

Avoiding the Exploration-Exploitation Tradeoff in Contextual Bandits

نویسندگان

چکیده

منابع مشابه

Contextual Bandits: Approximated Linear Bayes for Large Contexts

Algorithms with Logarithmic or Sublinear Regret for Constrained Contextual Bandits

Exploration-Free Policies in Dynamic Pricing and Online Decision-Making

Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits

Linear Bayes policy for learning in contextual-bandits

عنوان ژورنال:

اشتراک گذاری